Marks: 60
The stock market has consistently proven to be a good place to invest in and save for the future. There are a lot of compelling reasons to invest in stocks. It can help in fighting inflation, create wealth, and also provides some tax benefits. Good steady returns on investments over a long period of time can also grow a lot more than seems possible. Also, thanks to the power of compound interest, the earlier one starts investing, the larger the corpus one can have for retirement. Overall, investing in stocks can help meet life's financial aspirations.
It is important to maintain a diversified portfolio when investing in stocks in order to maximise earnings under any market condition. Having a diversified portfolio tends to yield higher returns and face lower risk by tempering potential losses when the market is down. It is often easy to get lost in a sea of financial metrics to analyze while determining the worth of a stock, and doing the same for a multitude of stocks to identify the right picks for an individual can be a tedious task. By doing a cluster analysis, one can identify stocks that exhibit similar characteristics and ones which exhibit minimum correlation. This will help investors better analyze stocks across different market segments and help protect against risks that could make the portfolio vulnerable to losses.
Trade&Ahead is a financial consultancy firm who provide their customers with personalized investment strategies. They have hired you as a Data Scientist and provided you with data comprising stock price and some financial indicators for a few companies listed under the New York Stock Exchange. They have assigned you the tasks of analyzing the data, grouping the stocks based on the attributes provided, and sharing insights about the characteristics of each group.
# Libraries to help with reading and manipulating data
import numpy as np
import pandas as pd
# Libraries to help with data visualization
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_theme()
# Removes the limit for the number of displayed columns
pd.set_option("display.max_columns", None)
# Sets the limit for the number of displayed rows
pd.set_option("display.max_rows", 200)
# to scale the data using z-score
from sklearn.preprocessing import StandardScaler
# to compute distances
from scipy.spatial.distance import pdist, cdist
# to perform k-means clustering, compute metric
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
# to perform hierarchical clustering, compute cophenetic correlation, and create dendrograms
from sklearn.cluster import AgglomerativeClustering
from scipy.cluster.hierarchy import dendrogram, linkage, cophenet
#!pip install yellowbrick
from yellowbrick.cluster import KElbowVisualizer, SilhouetteVisualizer
# to perform PCA
from sklearn.decomposition import PCA
# to disable warnings
import warnings
warnings.filterwarnings('ignore')
# loading the dataset
data = pd.read_csv("stock_data.csv")
data.shape
(340, 15)
# viewing a random sample of the dataset
data.sample(n=10, random_state=1)
| Ticker Symbol | Security | GICS Sector | GICS Sub Industry | Current Price | Price Change | Volatility | ROE | Cash Ratio | Net Cash Flow | Net Income | Earnings Per Share | Estimated Shares Outstanding | P/E Ratio | P/B Ratio | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 102 | DVN | Devon Energy Corp. | Energy | Oil & Gas Exploration & Production | 32.000000 | -15.478079 | 2.923698 | 205 | 70 | 830000000 | -14454000000 | -35.55 | 4.065823e+08 | 93.089287 | 1.785616 |
| 125 | FB | Information Technology | Internet Software & Services | 104.660004 | 16.224320 | 1.320606 | 8 | 958 | 592000000 | 3669000000 | 1.31 | 2.800763e+09 | 79.893133 | 5.884467 | |
| 11 | AIV | Apartment Investment & Mgmt | Real Estate | REITs | 40.029999 | 7.578608 | 1.163334 | 15 | 47 | 21818000 | 248710000 | 1.52 | 1.636250e+08 | 26.335526 | -1.269332 |
| 248 | PG | Procter & Gamble | Consumer Staples | Personal Products | 79.410004 | 10.660538 | 0.806056 | 17 | 129 | 160383000 | 636056000 | 3.28 | 4.913916e+08 | 24.070121 | -2.256747 |
| 238 | OXY | Occidental Petroleum | Energy | Oil & Gas Exploration & Production | 67.610001 | 0.865287 | 1.589520 | 32 | 64 | -588000000 | -7829000000 | -10.23 | 7.652981e+08 | 93.089287 | 3.345102 |
| 336 | YUM | Yum! Brands Inc | Consumer Discretionary | Restaurants | 52.516175 | -8.698917 | 1.478877 | 142 | 27 | 159000000 | 1293000000 | 2.97 | 4.353535e+08 | 17.682214 | -3.838260 |
| 112 | EQT | EQT Corporation | Energy | Oil & Gas Exploration & Production | 52.130001 | -21.253771 | 2.364883 | 2 | 201 | 523803000 | 85171000 | 0.56 | 1.520911e+08 | 93.089287 | 9.567952 |
| 147 | HAL | Halliburton Co. | Energy | Oil & Gas Equipment & Services | 34.040001 | -5.101751 | 1.966062 | 4 | 189 | 7786000000 | -671000000 | -0.79 | 8.493671e+08 | 93.089287 | 17.345857 |
| 89 | DFS | Discover Financial Services | Financials | Consumer Finance | 53.619999 | 3.653584 | 1.159897 | 20 | 99 | 2288000000 | 2297000000 | 5.14 | 4.468872e+08 | 10.431906 | -0.375934 |
| 173 | IVZ | Invesco Ltd. | Financials | Asset Management & Custody Banks | 33.480000 | 7.067477 | 1.580839 | 12 | 67 | 412000000 | 968100000 | 2.26 | 4.283628e+08 | 14.814159 | 4.218620 |
# copying the data to another variable to avoid any changes to original data
df = data.copy()
# checking datatypes and number of non-null values for each column
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 340 entries, 0 to 339 Data columns (total 15 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Ticker Symbol 340 non-null object 1 Security 340 non-null object 2 GICS Sector 340 non-null object 3 GICS Sub Industry 340 non-null object 4 Current Price 340 non-null float64 5 Price Change 340 non-null float64 6 Volatility 340 non-null float64 7 ROE 340 non-null int64 8 Cash Ratio 340 non-null int64 9 Net Cash Flow 340 non-null int64 10 Net Income 340 non-null int64 11 Earnings Per Share 340 non-null float64 12 Estimated Shares Outstanding 340 non-null float64 13 P/E Ratio 340 non-null float64 14 P/B Ratio 340 non-null float64 dtypes: float64(7), int64(4), object(4) memory usage: 40.0+ KB
# Ticker_Symbol is an abbreviation to identify the stocks, and is not required for further analysis
# Dropping Ticker Symbol
df.drop("Ticker Symbol", axis=1, inplace=True)
# We should convert the object type columns to categories, which reduces the memory required to store the dataframe.
# convert all columns with dtype object into category
for col in df.columns[df.dtypes=='object']:
df[col] = df[col].astype('category')
# confirm conversions and dropped variable
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 340 entries, 0 to 339 Data columns (total 14 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Security 340 non-null category 1 GICS Sector 340 non-null category 2 GICS Sub Industry 340 non-null category 3 Current Price 340 non-null float64 4 Price Change 340 non-null float64 5 Volatility 340 non-null float64 6 ROE 340 non-null int64 7 Cash Ratio 340 non-null int64 8 Net Cash Flow 340 non-null int64 9 Net Income 340 non-null int64 10 Earnings Per Share 340 non-null float64 11 Estimated Shares Outstanding 340 non-null float64 12 P/E Ratio 340 non-null float64 13 P/B Ratio 340 non-null float64 dtypes: category(3), float64(7), int64(4) memory usage: 46.7 KB
# fixing column names
df.columns = [c.replace(" ", "_") for c in df.columns]
# lets check duplicate observations
df.duplicated().sum()
0
# lets check total null values
df.isnull().sum().sum()
0
# Let's look at the statistical summary of the data
df.describe(include="all").T
| count | unique | top | freq | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| Security | 340 | 340 | 3M Company | 1 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| GICS_Sector | 340 | 11 | Industrials | 53 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| GICS_Sub_Industry | 340 | 104 | Oil & Gas Exploration & Production | 16 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| Current_Price | 340.0 | NaN | NaN | NaN | 80.862345 | 98.055086 | 4.5 | 38.555 | 59.705 | 92.880001 | 1274.949951 |
| Price_Change | 340.0 | NaN | NaN | NaN | 4.078194 | 12.006338 | -47.129693 | -0.939484 | 4.819505 | 10.695493 | 55.051683 |
| Volatility | 340.0 | NaN | NaN | NaN | 1.525976 | 0.591798 | 0.733163 | 1.134878 | 1.385593 | 1.695549 | 4.580042 |
| ROE | 340.0 | NaN | NaN | NaN | 39.597059 | 96.547538 | 1.0 | 9.75 | 15.0 | 27.0 | 917.0 |
| Cash_Ratio | 340.0 | NaN | NaN | NaN | 70.023529 | 90.421331 | 0.0 | 18.0 | 47.0 | 99.0 | 958.0 |
| Net_Cash_Flow | 340.0 | NaN | NaN | NaN | 55537620.588235 | 1946365312.175789 | -11208000000.0 | -193906500.0 | 2098000.0 | 169810750.0 | 20764000000.0 |
| Net_Income | 340.0 | NaN | NaN | NaN | 1494384602.941176 | 3940150279.327936 | -23528000000.0 | 352301250.0 | 707336000.0 | 1899000000.0 | 24442000000.0 |
| Earnings_Per_Share | 340.0 | NaN | NaN | NaN | 2.776662 | 6.587779 | -61.2 | 1.5575 | 2.895 | 4.62 | 50.09 |
| Estimated_Shares_Outstanding | 340.0 | NaN | NaN | NaN | 577028337.75403 | 845849595.417695 | 27672156.86 | 158848216.1 | 309675137.8 | 573117457.325 | 6159292035.0 |
| P/E_Ratio | 340.0 | NaN | NaN | NaN | 32.612563 | 44.348731 | 2.935451 | 15.044653 | 20.819876 | 31.764755 | 528.039074 |
| P/B_Ratio | 340.0 | NaN | NaN | NaN | -1.718249 | 13.966912 | -76.119077 | -4.352056 | -1.06717 | 3.917066 | 129.064585 |
Questions:
# function to plot a boxplot and a histogram along the same scale.
def histogram_boxplot(data, feature, figsize=(12, 7), kde=False, bins=None):
"""
Boxplot and histogram combined
data: dataframe
feature: dataframe column
figsize: size of figure (default (12,7))
kde: whether to the show density curve (default False)
bins: number of bins for histogram (default None)
"""
f2, (ax_box2, ax_hist2) = plt.subplots(
nrows=2, # Number of rows of the subplot grid= 2
sharex=True, # x-axis will be shared among all subplots
gridspec_kw={"height_ratios": (0.25, 0.75)},
figsize=figsize,
) # creating the 2 subplots
sns.boxplot(
data=data, x=feature, ax=ax_box2, showmeans=True, color="violet"
) # boxplot will be created and a star will indicate the mean value of the column
sns.histplot(
data=data, x=feature, kde=kde, ax=ax_hist2, bins=bins, palette="winter"
) if bins else sns.histplot(
data=data, x=feature, kde=kde, ax=ax_hist2
) # For histogram
ax_hist2.axvline(
data[feature].mean(), color="green", linestyle="--"
) # Add mean to the histogram
ax_hist2.axvline(
data[feature].median(), color="black", linestyle="-"
) # Add median to the histogram
# selecting numerical columns
num_cols = df.select_dtypes(include=np.number).columns.tolist()
histogram_boxplot(df, num_cols[0], bins=50, kde=True, figsize=(10, 5))
histogram_boxplot(df, num_cols[1], bins=50, kde=True, figsize=(10, 5))
histogram_boxplot(df, num_cols[2], bins=50, kde=True, figsize=(10, 5))
histogram_boxplot(df, num_cols[3], bins=50, kde=True, figsize=(10, 5))
histogram_boxplot(df, num_cols[4], bins=50, kde=True, figsize=(10, 5))
histogram_boxplot(df, num_cols[5], bins=50, kde=True, figsize=(10, 5))
histogram_boxplot(df, num_cols[6], bins=50, kde=True, figsize=(10, 5))
histogram_boxplot(df, num_cols[7], bins=50, kde=True, figsize=(10, 5))
histogram_boxplot(df, num_cols[8], bins=50, kde=True, figsize=(10, 5))
histogram_boxplot(df, num_cols[9], bins=50, kde=True, figsize=(10, 5))
histogram_boxplot(df, num_cols[10], bins=50, kde=True, figsize=(10, 5))
# function to create labeled barplots
def labeled_barplot(data, feature, perc=False, n=None):
"""
Barplot with percentage at the top
data: dataframe
feature: dataframe column
perc: whether to display percentages instead of count (default is False)
n: displays the top n category levels (default is None, i.e., display all levels)
"""
total = len(data[feature]) # length of the column
count = data[feature].nunique()
if n is None:
plt.figure(figsize=(count + 1, 5))
else:
plt.figure(figsize=(n + 1, 5))
plt.xticks(rotation=90, fontsize=15)
ax = sns.countplot(
data=data,
x=feature,
palette="Paired",
order=data[feature].value_counts().index[:n].sort_values(),
)
for p in ax.patches:
if perc == True:
label = "{:.1f}%".format(
100 * p.get_height() / total
) # percentage of each class of the category
else:
label = p.get_height() # count of each level of the category
x = p.get_x() + p.get_width() / 2 # width of the plot
y = p.get_height() # height of the plot
ax.annotate(
label,
(x, y),
ha="center",
va="center",
size=12,
xytext=(0, 5),
textcoords="offset points",
) # annotate the percentage
plt.show() # show the plot
labeled_barplot(df, "GICS_Sector", perc=True)
pd.crosstab(df.GICS_Sub_Industry, df.GICS_Sector).style.highlight_max(
color="lightgreen", axis=0
)
| GICS_Sector | Consumer Discretionary | Consumer Staples | Energy | Financials | Health Care | Industrials | Information Technology | Materials | Real Estate | Telecommunications Services | Utilities |
|---|---|---|---|---|---|---|---|---|---|---|---|
| GICS_Sub_Industry | |||||||||||
| Advertising | 2 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| Aerospace & Defense | 0 | 0 | 0 | 0 | 0 | 4 | 0 | 0 | 0 | 0 | 0 |
| Agricultural Products | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| Air Freight & Logistics | 0 | 0 | 0 | 0 | 0 | 3 | 0 | 0 | 0 | 0 | 0 |
| Airlines | 0 | 0 | 0 | 0 | 0 | 5 | 0 | 0 | 0 | 0 | 0 |
| Alternative Carriers | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 |
| Apparel, Accessories & Luxury Goods | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| Application Software | 0 | 0 | 0 | 0 | 0 | 0 | 2 | 0 | 0 | 0 | 0 |
| Asset Management & Custody Banks | 0 | 0 | 0 | 4 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| Auto Parts & Equipment | 2 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| Automobile Manufacturers | 2 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| Banks | 0 | 0 | 0 | 10 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| Biotechnology | 0 | 0 | 0 | 0 | 7 | 0 | 0 | 0 | 0 | 0 | 0 |
| Brewers | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| Broadcasting & Cable TV | 2 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| Building Products | 0 | 0 | 0 | 0 | 0 | 4 | 0 | 0 | 0 | 0 | 0 |
| Cable & Satellite | 3 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| Casinos & Gaming | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| Computer Hardware | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 |
| Construction & Farm Machinery & Heavy Trucks | 0 | 0 | 0 | 0 | 0 | 3 | 0 | 0 | 0 | 0 | 0 |
| Construction Materials | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 2 | 0 | 0 | 0 |
| Consumer Electronics | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| Consumer Finance | 0 | 0 | 0 | 5 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| Copper | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 |
| Data Processing & Outsourced Services | 0 | 0 | 0 | 0 | 0 | 0 | 2 | 0 | 0 | 0 | 0 |
| Distributors | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| Diversified Chemicals | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 5 | 0 | 0 | 0 |
| Diversified Commercial Services | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 |
| Diversified Financial Services | 0 | 0 | 0 | 7 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| Drug Retail | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| Electric Utilities | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 12 |
| Electrical Components & Equipment | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 |
| Electronic Components | 0 | 0 | 0 | 0 | 0 | 0 | 2 | 0 | 0 | 0 | 0 |
| Electronic Equipment & Instruments | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 |
| Environmental Services | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 |
| Fertilizers & Agricultural Chemicals | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 2 | 0 | 0 | 0 |
| Financial Exchanges & Data | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| Gold | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 |
| Health Care Distributors | 0 | 0 | 0 | 0 | 3 | 0 | 0 | 0 | 0 | 0 | 0 |
| Health Care Equipment | 0 | 0 | 0 | 0 | 11 | 0 | 0 | 0 | 0 | 0 | 0 |
| Health Care Facilities | 0 | 0 | 0 | 0 | 5 | 0 | 0 | 0 | 0 | 0 | 0 |
| Health Care Supplies | 0 | 0 | 0 | 0 | 2 | 0 | 0 | 0 | 0 | 0 | 0 |
| Home Entertainment Software | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 |
| Home Furnishings | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| Homebuilding | 2 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| Hotels, Resorts & Cruise Lines | 4 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| Household Appliances | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| Household Products | 0 | 3 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| Housewares & Specialties | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| Human Resource & Employment Services | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 |
| IT Consulting & Other Services | 0 | 0 | 0 | 0 | 0 | 0 | 3 | 0 | 0 | 0 | 0 |
| Industrial Conglomerates | 0 | 0 | 0 | 0 | 0 | 14 | 0 | 0 | 0 | 0 | 0 |
| Industrial Gases | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 |
| Industrial Machinery | 0 | 0 | 0 | 0 | 0 | 5 | 0 | 0 | 0 | 0 | 0 |
| Industrial Materials | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 |
| Insurance Brokers | 0 | 0 | 0 | 3 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| Integrated Oil & Gas | 0 | 0 | 5 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| Integrated Telecommunications Services | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 4 | 0 |
| Internet & Direct Marketing Retail | 4 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| Internet Software & Services | 0 | 0 | 0 | 0 | 0 | 0 | 12 | 0 | 0 | 0 | 0 |
| Investment Banking & Brokerage | 0 | 0 | 0 | 2 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| Leisure Products | 2 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| Life & Health Insurance | 0 | 0 | 0 | 3 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| Life Sciences Tools & Services | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 |
| Managed Health Care | 0 | 0 | 0 | 0 | 5 | 0 | 0 | 0 | 0 | 0 | 0 |
| Metal & Glass Containers | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 |
| Motorcycle Manufacturers | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| Multi-Sector Holdings | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| Multi-line Insurance | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| MultiUtilities | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 11 |
| Networking Equipment | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 |
| Office REITs | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 |
| Oil & Gas Equipment & Services | 0 | 0 | 3 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| Oil & Gas Exploration & Production | 0 | 0 | 16 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| Oil & Gas Refining & Marketing & Transportation | 0 | 0 | 6 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| Packaged Foods & Meats | 0 | 6 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| Paper Packaging | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 2 | 0 | 0 | 0 |
| Personal Products | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| Pharmaceuticals | 0 | 0 | 0 | 0 | 6 | 0 | 0 | 0 | 0 | 0 | 0 |
| Property & Casualty Insurance | 0 | 0 | 0 | 8 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| Publishing | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| REITs | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 14 | 0 | 0 |
| Railroads | 0 | 0 | 0 | 0 | 0 | 4 | 0 | 0 | 0 | 0 | 0 |
| Real Estate Services | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 |
| Regional Banks | 0 | 0 | 0 | 3 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| Research & Consulting Services | 0 | 0 | 0 | 0 | 0 | 4 | 0 | 0 | 0 | 0 | 0 |
| Residential REITs | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 4 | 0 | 0 |
| Restaurants | 3 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| Retail REITs | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 4 | 0 | 0 |
| Semiconductor Equipment | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 |
| Semiconductors | 0 | 0 | 0 | 0 | 0 | 0 | 6 | 0 | 0 | 0 | 0 |
| Soft Drinks | 0 | 4 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| Specialized REITs | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 3 | 0 | 0 |
| Specialty Chemicals | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 4 | 0 | 0 | 0 |
| Specialty Retail | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| Specialty Stores | 3 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| Steel | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 |
| Technology Hardware, Storage & Peripherals | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 |
| Technology, Hardware, Software and Supplies | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 |
| Thrifts & Mortgage Finance | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| Tires & Rubber | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| Tobacco | 0 | 2 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| Trucking | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 |
| Water Utilities | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
# Bivariate analysis
# Stock Price_Change Vs. GICS_Sector
plt.figure(figsize=(20,6))
sns.boxplot(data = df, y = "Price_Change", x = "GICS_Sector");
# Cash_Ratio Vs. GICS_Sector
plt.figure(figsize=(20,8))
sns.boxplot(data = df, y = "Cash_Ratio", x = "GICS_Sector");
# P/E_Ratio Vs. GICS_Sector
plt.figure(figsize=(20,8))
sns.boxplot(data = df, y = "P/E_Ratio", x = "GICS_Sector");
# check for correlations
plt.figure(figsize=(15, 7))
sns.heatmap(
df[num_cols].corr(), annot=True, vmin=-1, vmax=1, fmt=".2f", cmap="Spectral"
)
plt.show()
# Pair-plot analysis
sns.pairplot(df[num_cols],diag_kind="kde")
<seaborn.axisgrid.PairGrid at 0x7b3af4c25ba0>
# lets check duplicate observations
df.duplicated().sum()
0
# lets check misssing values
df.isnull().sum().sum()
0
# Scaling the data to bring it to the same scale
sc = StandardScaler()
subset_scaled_df = pd.DataFrame(
sc.fit_transform(df.drop(["Security", "GICS_Sector", "GICS_Sub_Industry"], axis=1)),
columns=df.drop(["Security", "GICS_Sector", "GICS_Sub_Industry"], axis=1).columns,
)
subset_scaled_df.head()
| Current_Price | Price_Change | Volatility | ROE | Cash_Ratio | Net_Cash_Flow | Net_Income | Earnings_Per_Share | Estimated_Shares_Outstanding | P/E_Ratio | P/B_Ratio | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | -0.393341 | 0.493950 | 0.272749 | 0.989601 | -0.210698 | -0.339355 | 1.554415 | 1.309399 | 0.107863 | -0.652487 | -0.506653 |
| 1 | -0.220837 | 0.355439 | 1.137045 | 0.937737 | 0.077269 | -0.002335 | 0.927628 | 0.056755 | 1.250274 | -0.311769 | -0.504205 |
| 2 | -0.367195 | 0.602479 | -0.427007 | -0.192905 | -0.033488 | 0.454058 | 0.744371 | 0.024831 | 1.098021 | -0.391502 | 0.094941 |
| 3 | 0.133567 | 0.825696 | -0.284802 | -0.317379 | 1.218059 | -0.152497 | -0.219816 | -0.230563 | -0.091622 | 0.947148 | 0.424333 |
| 4 | -0.260874 | -0.492636 | 0.296470 | -0.265515 | 2.237018 | 0.133564 | -0.202703 | -0.374982 | 1.978399 | 3.293307 | 0.199196 |
# Pair-plot analysis
sns.pairplot(subset_scaled_df ,diag_kind="kde");
clusters=range(1,9)
meanDistortions=[]
for k in clusters:
model=KMeans(n_clusters=k)
model.fit(subset_scaled_df)
prediction=model.predict(subset_scaled_df)
distortion=sum(np.min(cdist(subset_scaled_df, model.cluster_centers_, 'euclidean'), axis=1)) / subset_scaled_df.shape[0]
meanDistortions.append(distortion)
print('Number of Clusters:', k, '\tAverage Distortion:', distortion)
plt.plot(clusters, meanDistortions, 'bx-')
plt.xlabel('k')
plt.ylabel('Average Distortion')
plt.title('Selecting k with the Elbow Method', fontsize=20)
Number of Clusters: 1 Average Distortion: 2.5425069919221697 Number of Clusters: 2 Average Distortion: 2.384499097487295 Number of Clusters: 3 Average Distortion: 2.2683105560042285 Number of Clusters: 4 Average Distortion: 2.179645269703779 Number of Clusters: 5 Average Distortion: 2.126829442356109 Number of Clusters: 6 Average Distortion: 2.0537507716340393 Number of Clusters: 7 Average Distortion: 2.046245663604377 Number of Clusters: 8 Average Distortion: 1.9893285845489175
Text(0.5, 1.0, 'Selecting k with the Elbow Method')
# checking silhoutte score
sil_score = []
cluster_list = list(range(2,10))
for n_clusters in cluster_list:
clusterer = KMeans(n_clusters=n_clusters)
preds = clusterer.fit_predict((subset_scaled_df))
#centers = clusterer.cluster_centers_
score = silhouette_score(subset_scaled_df, preds)
sil_score.append(score)
print("For n_clusters = {}, silhouette score is {})".format(n_clusters, score))
For n_clusters = 2, silhouette score is 0.43969639509980457) For n_clusters = 3, silhouette score is 0.45797710447228496) For n_clusters = 4, silhouette score is 0.45434371948348606) For n_clusters = 5, silhouette score is 0.4328525384867999) For n_clusters = 6, silhouette score is 0.41439237473055257) For n_clusters = 7, silhouette score is 0.4265962625132064) For n_clusters = 8, silhouette score is 0.41923671063647955) For n_clusters = 9, silhouette score is 0.3678986004890057)
plt.plot(cluster_list,sil_score)
plt.grid()
# Finding optimal no. of clusters with silhouette coefficients
visualizer = SilhouetteVisualizer(KMeans(4, random_state = 1))
visualizer.fit(subset_scaled_df)
visualizer.show();
visualizer = SilhouetteVisualizer(KMeans(5, random_state = 1))
visualizer.fit(subset_scaled_df)
visualizer.show();
visualizer = SilhouetteVisualizer(KMeans(6, random_state = 1))
visualizer.fit(subset_scaled_df)
visualizer.show();
# Choosing k=4 clusters
kmeans = KMeans(n_clusters=4, random_state=0)
kmeans.fit(subset_scaled_df)
KMeans(n_clusters=4, random_state=0)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
KMeans(n_clusters=4, random_state=0)
df['K_means_segments'] = kmeans.labels_
subset_scaled_df['K_means_segments'] = kmeans.labels_
cluster_profile = df.groupby('K_means_segments').mean()
cluster_profile['count_in_each_segments'] = df.groupby('K_means_segments')['Security'].count().values
cluster_profile
| Current_Price | Price_Change | Volatility | ROE | Cash_Ratio | Net_Cash_Flow | Net_Income | Earnings_Per_Share | Estimated_Shares_Outstanding | P/E_Ratio | P/B_Ratio | count_in_each_segments | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| K_means_segments | ||||||||||||
| 0 | 234.170932 | 13.400685 | 1.729989 | 25.600000 | 277.640000 | 1.554927e+09 | 1.572612e+09 | 6.045200 | 5.783163e+08 | 74.960824 | 14.402452 | 25 |
| 1 | 38.099260 | -15.370329 | 2.910500 | 107.074074 | 50.037037 | -1.594285e+08 | -3.887458e+09 | -9.473704 | 4.803986e+08 | 90.619220 | 1.342067 | 27 |
| 2 | 50.517273 | 5.747586 | 1.130399 | 31.090909 | 75.909091 | -1.072273e+09 | 1.483309e+10 | 4.154545 | 4.298827e+09 | 14.803577 | -4.552119 | 11 |
| 3 | 72.399112 | 5.066225 | 1.388319 | 34.620939 | 53.000000 | -1.404622e+07 | 1.482212e+09 | 3.621029 | 4.385338e+08 | 23.843656 | -3.358948 | 277 |
fig, axes = plt.subplots(3, 4, figsize=(20, 16))
fig.suptitle('Boxplot of numerical variables for each cluster', fontsize=20)
counter = 0
for ii in range(3):
for jj in range(4):
if counter <11:
sns.boxplot(ax=axes[ii, jj],y=subset_scaled_df[num_cols[counter]],x=subset_scaled_df['K_means_segments'])
counter = counter+1
fig.tight_layout(pad=2.0)
Cluster 0
Cluster 1
Cluster 3
Cluster 2
This cluster is very similar to Cluster 3 in it's behavior but with two striking differences:
Has securities with some of the highest average Net_Income of 1.483309e+10 (almost 10 times of those in Cluster 3)
Has securities with some of the highest Estimated_Shares_Outstanding of 4.298827e+09 (almost 10 times of those in Cluster 3)
Clusters 3 and 2 are the safe clusters, with clusters 2 containing more exclusive securities. Clusters 0 and 1 are more riskier securities, former being high performing and later historically speaking low performing
# Comparing cluster vs. GICS_Sector
pd.crosstab(df.GICS_Sector, df.K_means_segments).style.highlight_max(color = 'lightgreen', axis = 0)
| K_means_segments | 0 | 1 | 2 | 3 |
|---|---|---|---|---|
| GICS_Sector | ||||
| Consumer Discretionary | 6 | 0 | 1 | 33 |
| Consumer Staples | 1 | 0 | 1 | 17 |
| Energy | 1 | 22 | 1 | 6 |
| Financials | 1 | 0 | 3 | 45 |
| Health Care | 9 | 0 | 2 | 29 |
| Industrials | 0 | 1 | 0 | 52 |
| Information Technology | 5 | 3 | 1 | 24 |
| Materials | 0 | 1 | 0 | 19 |
| Real Estate | 1 | 0 | 0 | 26 |
| Telecommunications Services | 1 | 0 | 2 | 2 |
| Utilities | 0 | 0 | 0 | 24 |
# list of distance metrics
distance_metrics = ["euclidean", "chebyshev", "mahalanobis", "cityblock"]
# list of linkage methods
linkage_methods = ["single", "complete", "average", "weighted"]
high_cophenet_corr = 0
high_dm_lm = [0, 0]
for dm in distance_metrics:
for lm in linkage_methods:
Z = linkage(subset_scaled_df, metric=dm, method=lm)
c, coph_dists = cophenet(Z, pdist(subset_scaled_df))
print(
"Cophenetic correlation for {} distance and {} linkage is {}".format(
dm.capitalize(), lm, c
)
)
if high_cophenet_corr < c:
high_cophenet_corr = c
high_dm_lm[0] = dm
high_dm_lm[1] = lm
Cophenetic correlation for Euclidean distance and single linkage is 0.9304469769832865 Cophenetic correlation for Euclidean distance and complete linkage is 0.8559480642212798 Cophenetic correlation for Euclidean distance and average linkage is 0.946403836884538 Cophenetic correlation for Euclidean distance and weighted linkage is 0.7508819056084053 Cophenetic correlation for Chebyshev distance and single linkage is 0.9161627445317929 Cophenetic correlation for Chebyshev distance and complete linkage is 0.822502094153258 Cophenetic correlation for Chebyshev distance and average linkage is 0.9379218754329659 Cophenetic correlation for Chebyshev distance and weighted linkage is 0.9153206618543516 Cophenetic correlation for Mahalanobis distance and single linkage is 0.9348505176633238 Cophenetic correlation for Mahalanobis distance and complete linkage is 0.6881861661402056 Cophenetic correlation for Mahalanobis distance and average linkage is 0.9360657692078034 Cophenetic correlation for Mahalanobis distance and weighted linkage is 0.8810701545336993 Cophenetic correlation for Cityblock distance and single linkage is 0.938373245895409 Cophenetic correlation for Cityblock distance and complete linkage is 0.8124007660644492 Cophenetic correlation for Cityblock distance and average linkage is 0.9168123859372297 Cophenetic correlation for Cityblock distance and weighted linkage is 0.866729262879581
# printing the combination of distance metric and linkage method with the highest cophenetic correlation
print(
"Highest cophenetic correlation is {}, which is obtained with {} distance and {} linkage".format(
high_cophenet_corr, high_dm_lm[0].capitalize(), high_dm_lm[1]
)
)
Highest cophenetic correlation is 0.946403836884538, which is obtained with Euclidean distance and average linkage
# list of linkage methods
linkage_methods = ["single", "complete", "average", "centroid", "ward", "weighted"]
high_cophenet_corr = 0
high_dm_lm = [0, 0]
for lm in linkage_methods:
Z = linkage(subset_scaled_df, metric="euclidean", method=lm)
c, coph_dists = cophenet(Z, pdist(subset_scaled_df))
print("Cophenetic correlation for {} linkage is {}".format(lm, c))
if high_cophenet_corr < c:
high_cophenet_corr = c
high_dm_lm[0] = "euclidean"
high_dm_lm[1] = lm
Cophenetic correlation for single linkage is 0.9304469769832865 Cophenetic correlation for complete linkage is 0.8559480642212798 Cophenetic correlation for average linkage is 0.946403836884538 Cophenetic correlation for centroid linkage is 0.9494262703881242 Cophenetic correlation for ward linkage is 0.7436374975239648 Cophenetic correlation for weighted linkage is 0.7508819056084053
# printing the combination of distance metric and linkage method with the highest cophenetic correlation
print(
"Highest cophenetic correlation is {}, which is obtained with {} linkage".format(
high_cophenet_corr, high_dm_lm[1]
)
)
Highest cophenetic correlation is 0.9494262703881242, which is obtained with centroid linkage
# list of linkage methods
linkage_methods = ["single", "complete", "average", "centroid", "ward", "weighted"]
# lists to save results of cophenetic correlation calculation
compare_cols = ["Linkage", "Cophenetic Coefficient"]
# to create a subplot image
fig, axs = plt.subplots(len(linkage_methods), 1, figsize=(15, 30))
# We will enumerate through the list of linkage methods above
# For each linkage method, we will plot the dendrogram and calculate the cophenetic correlation
for i, method in enumerate(linkage_methods):
Z = linkage(subset_scaled_df, metric="euclidean", method=method)
dendrogram(Z, ax=axs[i])
axs[i].set_title(f"Dendrogram ({method.capitalize()} Linkage)")
coph_corr, coph_dist = cophenet(Z, pdist(subset_scaled_df))
axs[i].annotate(
f"Cophenetic\nCorrelation\n{coph_corr:0.2f}",
(0.80, 0.80),
xycoords="axes fraction",
)
HCmodel = AgglomerativeClustering(n_clusters=6, affinity="euclidean", linkage="average")
HCmodel.fit(subset_scaled_df)
AgglomerativeClustering(affinity='euclidean', linkage='average', n_clusters=6)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
AgglomerativeClustering(affinity='euclidean', linkage='average', n_clusters=6)
subset_scaled_df["HC_Clusters"] = HCmodel.labels_
df["HC_Clusters"] = HCmodel.labels_
cluster_profile = df.groupby("HC_Clusters").mean()
cluster_profile["count_in_each_segments"] = (
df.groupby("HC_Clusters")["Security"].count().values
)
cluster_profile
| Current_Price | Price_Change | Volatility | ROE | Cash_Ratio | Net_Cash_Flow | Net_Income | Earnings_Per_Share | Estimated_Shares_Outstanding | P/E_Ratio | P/B_Ratio | K_means_segments | count_in_each_segments | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| HC_Clusters | |||||||||||||
| 0 | 77.287589 | 4.099730 | 1.518066 | 35.336336 | 66.900901 | -3.319732e+07 | 1.538075e+09 | 2.88527 | 5.605050e+08 | 32.441706 | -2.174921 | 2.630631 | 333 |
| 1 | 25.640000 | 11.237908 | 1.322355 | 12.500000 | 130.500000 | 1.675550e+10 | 1.365400e+10 | 3.29500 | 2.791829e+09 | 13.649696 | 1.508484 | 1.000000 | 2 |
| 2 | 24.485001 | -13.351992 | 3.482611 | 802.000000 | 51.000000 | -1.292500e+09 | -1.910650e+10 | -41.81500 | 5.195740e+08 | 60.748608 | 1.565141 | 1.000000 | 2 |
| 3 | 104.660004 | 16.224320 | 1.320606 | 8.000000 | 958.000000 | 5.920000e+08 | 3.669000e+09 | 1.31000 | 2.800763e+09 | 79.893133 | 5.884467 | 0.000000 | 1 |
| 4 | 1274.949951 | 3.190527 | 1.268340 | 29.000000 | 184.000000 | -1.671386e+09 | 2.551360e+09 | 50.09000 | 5.093552e+07 | 25.453183 | -1.052429 | 0.000000 | 1 |
| 5 | 276.570007 | 6.189286 | 1.116976 | 30.000000 | 25.000000 | 9.088500e+07 | 5.965410e+08 | 8.91000 | 6.695185e+07 | 31.040405 | 129.064585 | 0.000000 | 1 |
HCmodel = AgglomerativeClustering(n_clusters=4, affinity="euclidean", linkage="ward")
HCmodel.fit(subset_scaled_df)
AgglomerativeClustering(affinity='euclidean', n_clusters=4)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
AgglomerativeClustering(affinity='euclidean', n_clusters=4)
subset_scaled_df["HC_Clusters"] = HCmodel.labels_
df["HC_Clusters"] = HCmodel.labels_
cluster_profile = df.groupby("HC_Clusters").mean()
cluster_profile["count_in_each_segments"] = (
df.groupby("HC_Clusters")["Security"].count().values
)
cluster_profile
| Current_Price | Price_Change | Volatility | ROE | Cash_Ratio | Net_Cash_Flow | Net_Income | Earnings_Per_Share | Estimated_Shares_Outstanding | P/E_Ratio | P/B_Ratio | K_means_segments | count_in_each_segments | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| HC_Clusters | |||||||||||||
| 0 | 46.558126 | -11.798670 | 2.617878 | 178.750000 | 50.250000 | 4.349716e+07 | -3.197472e+09 | -7.785312 | 4.732895e+08 | 72.496532 | -0.780467 | 1.281250 | 32 |
| 1 | 71.846974 | 4.953643 | 1.392784 | 25.117216 | 53.831502 | 1.197788e+06 | 1.557674e+09 | 3.691044 | 4.439183e+08 | 23.583804 | -3.087957 | 2.996337 | 273 |
| 2 | 229.579357 | 14.049986 | 1.735216 | 25.423077 | 268.423077 | 1.712688e+09 | 1.981882e+09 | 5.946923 | 7.219242e+08 | 84.216911 | 13.114240 | 0.115385 | 26 |
| 3 | 46.672222 | 5.166566 | 1.079367 | 25.000000 | 58.333333 | -3.040667e+09 | 1.484844e+10 | 3.435556 | 4.564960e+09 | 15.596051 | -6.354193 | 2.000000 | 9 |
# let's see the names of the securities in each cluster
for cl in df["HC_Clusters"].unique():
print(
"The",
df[df["HC_Clusters"] == cl]["Security"].nunique(),
"Securities in cluster",
cl,
"are:",
)
print(df[df["HC_Clusters"] == cl]["Security"].unique())
print("-" * 100, "\n")
The 273 Securities in cluster 1 are:
['American Airlines Group', 'AbbVie', 'Abbott Laboratories', 'Adobe Systems Inc', 'Archer-Daniels-Midland Co', ..., 'Xylem Inc.', 'Yum! Brands Inc', 'Zimmer Biomet Holdings', 'Zions Bancorp', 'Zoetis']
Length: 273
Categories (340, object): ['3M Company', 'AFLAC Inc', 'AMETEK Inc', 'AT&T Inc', ...,
'Zimmer Biomet Holdings', 'Zions Bancorp', 'Zoetis', 'eBay Inc.']
----------------------------------------------------------------------------------------------------
The 26 Securities in cluster 2 are:
['Analog Devices, Inc.', 'Alliance Data Systems', 'Alexion Pharmaceuticals', 'Amgen Inc', 'Amazon.com Inc', ..., 'TripAdvisor', 'Vertex Pharmaceuticals Inc', 'Waters Corporation', 'Wynn Resorts Ltd', 'Yahoo Inc.']
Length: 26
Categories (340, object): ['3M Company', 'AFLAC Inc', 'AMETEK Inc', 'AT&T Inc', ...,
'Zimmer Biomet Holdings', 'Zions Bancorp', 'Zoetis', 'eBay Inc.']
----------------------------------------------------------------------------------------------------
The 32 Securities in cluster 0 are:
['Allegion', 'Apache Corporation', 'Anadarko Petroleum Corp', 'Baker Hughes Inc', 'Chesapeake Energy', ..., 'S&P Global, Inc.', 'Southwestern Energy', 'Teradata Corp.', 'Williams Cos.', 'Cimarex Energy']
Length: 32
Categories (340, object): ['3M Company', 'AFLAC Inc', 'AMETEK Inc', 'AT&T Inc', ...,
'Zimmer Biomet Holdings', 'Zions Bancorp', 'Zoetis', 'eBay Inc.']
----------------------------------------------------------------------------------------------------
The 9 Securities in cluster 3 are:
['Citigroup Inc.', 'Ford Motor', 'JPMorgan Chase & Co.', 'Coca Cola Company', 'Pfizer Inc.', 'AT&T Inc', 'Verizon Communications', 'Wells Fargo', 'Exxon Mobil Corp.']
Categories (340, object): ['3M Company', 'AFLAC Inc', 'AMETEK Inc', 'AT&T Inc', ...,
'Zimmer Biomet Holdings', 'Zions Bancorp', 'Zoetis', 'eBay Inc.']
----------------------------------------------------------------------------------------------------
fig, axes = plt.subplots(3, 4, figsize=(20, 16))
fig.suptitle('Boxplot of numerical variables for each cluster', fontsize=20)
counter = 0
for ii in range(3):
for jj in range(4):
if counter <11:
sns.boxplot(ax=axes[ii, jj],y=subset_scaled_df[num_cols[counter]],x=subset_scaled_df['HC_Clusters'])
counter = counter+1
fig.tight_layout(pad=2.0)
Although, minor differences here and there, groupings obtained with Hierarchical clustering using Eucledian distance & Ward linkage is similar to the one obtained using K-Means clustering!
Cluster 0 of Hierarchical clustering
Cluster 1 of Hierarchical clustering
Cluster 2 of Hierarchical clustering
Cluster 3 of Hierarchical clustering
# Comparing cluster vs. GICS_Sector
pd.crosstab(df.GICS_Sector, df.HC_Clusters).style.highlight_max(color = 'lightgreen', axis = 0)
| HC_Clusters | 0 | 1 | 2 | 3 |
|---|---|---|---|---|
| GICS_Sector | ||||
| Consumer Discretionary | 1 | 32 | 6 | 1 |
| Consumer Staples | 2 | 15 | 1 | 1 |
| Energy | 23 | 6 | 0 | 1 |
| Financials | 1 | 44 | 1 | 3 |
| Health Care | 0 | 30 | 9 | 1 |
| Industrials | 2 | 51 | 0 | 0 |
| Information Technology | 2 | 24 | 7 | 0 |
| Materials | 1 | 19 | 0 | 0 |
| Real Estate | 0 | 26 | 1 | 0 |
| Telecommunications Services | 0 | 2 | 1 | 2 |
| Utilities | 0 | 24 | 0 | 0 |
You compare several things, like:
You can also mention any differences or similarities you obtained in the cluster profiles from both the clustering techniques.
import time
# Initialize K-means clustering with the 4 clusters
n_clusters = 4
kmeans = KMeans(n_clusters=n_clusters)
# Measure the execution time
start_time = time.time()
kmeans.fit(subset_scaled_df)
end_time = time.time()
# Calculate the execution time
execution_time = end_time - start_time
print(f"K-means clustering took {execution_time:.4f} seconds")
K-means clustering took 0.0859 seconds
# Measure the execution time for hierarchical clustering
start_time = time.time()
# Perform hierarchical clustering
HCmodel = AgglomerativeClustering(n_clusters=4, affinity="euclidean", linkage="ward")
HCmodel.fit(subset_scaled_df)
# End timing
end_time = time.time()
# Calculate the execution time
execution_time = end_time - start_time
print(f"Hierarchical clustering took {execution_time:.4f} seconds")
Hierarchical clustering took 0.0110 seconds
# Comparing Hierarchical cluster vs. K-means cluster
pd.crosstab(df.K_means_segments, df.HC_Clusters).style.highlight_max(color = 'lightgreen', axis = 0)
| HC_Clusters | 0 | 1 | 2 | 3 |
|---|---|---|---|---|
| K_means_segments | ||||
| 0 | 1 | 0 | 24 | 0 |
| 1 | 26 | 0 | 1 | 0 |
| 2 | 0 | 1 | 1 | 9 |
| 3 | 5 | 272 | 0 | 0 |
# PCA to reduce the data to two dimensions and visualize it to see how well-separated the clusters are
# setting the number of components to 2
pca = PCA(n_components=2)
# transforming data and storing results in a dataframe
X_reduced_pca = pca.fit_transform(subset_scaled_df)
reduced_df_pca = pd.DataFrame(
data=X_reduced_pca, columns=["Component 1", "Component 2"]
)
# checking the amount of variance explained
pca.explained_variance_ratio_.sum()
0.38933325456234846
sns.scatterplot(
data=reduced_df_pca,
x="Component 1",
y="Component 2",
hue=df["K_means_segments"],
palette="rainbow",
)
plt.legend(bbox_to_anchor=(1, 1))
<matplotlib.legend.Legend at 0x7b3af6a843d0>
sns.scatterplot(
data=reduced_df_pca,
x="Component 1",
y="Component 2",
hue=df["HC_Clusters"],
palette="rainbow",
)
plt.legend(bbox_to_anchor=(1, 1))
<matplotlib.legend.Legend at 0x7b3af70f4c70>
-
Insights
Exploratory Data Analysis
Current_Price of stocks, and Estimated_Shares_Outstanding across securities for all sectors is right skewed (with several positive outliers)
Health Care and Financial sectors have seen some of the highest positive Price_Change in the last 13 weeks, making them favorable to investors
Informational Technology and Financial sectors have some of the highest Cash_Ratios making them favorable more so than other sectors
Real Estate sector has seen minimum variation in Price_Change & minimum variation in Cash_Ratio across securities it encompasses making them a safer investment choice for investors
Energy sector has some of the highest variance in Price_Change across securities it encompasses, being more volatile and riskier for investors. However, this sector has securities with high P/E_Ratios. This indicates an investor is willing to invest more in a single share of a company in Energy sector per dollar value of its earnings as opposed to securities in any other sectors
Clustering Profiles
Out of 340 securities in our data set, both clustering methods clustered 331 securities in a similar fashion with 9 securities being clustered differently. The industry segregation into clusters yielded similar results across both algorithms. PCA performed for both clustering techniques yielded similar cluster densities in 2D space
One cluster (25+securities) was identified as very aggressive (& high performing) belonging predominantly to Health Care followed by Consumer Discretionary and Information Technology sectors
Another cluster (25+securities) was identified as very aggressive (but historically low performing) belonging predominantly to Energy sector
Another cluster (~10securities) was identified as moderately aggressive (& high performing) belonging predominantly to Financials sector
Finally, a major cluster (270+securities) was identified as mildly aggressive & safe investment option. This cluster is diversified with securities predominantly belonging to Industrials, followed by Financials, Consumer Discretionary, Real Estate, & Informational Technology sectors
Recommendations
Securities were segregated into 4 different clusters identifying very aggressive (high & low performing), moderately aggressive (& high performing) and mildly aggressive options. This is important in an effort to split the stocks across investments that are diversified, enabling one to maximize earnings in any market condition
However, it is important to keep in mind that stock market is often volatile, and past indicators may not always indicate future trends. Dynamic clustering (as more data is added each day) & movement of stocks across cluster groups due to changing market conditions needs to be further analyzed for making better predictions